1 Overview and set up

In this document, we would like to use an example project to show how we can write reports in RStudio using R Markdown. We use an example gene expression data set which includes gene expression values (n = 11900) for 10 samples, covering two conditions (Control and TGFb-treated). This data is a subset of an integrated data set from this paper. We also have a expression signatures, called TGFb-EMT signature generated in the same paper. Both the data subset and the signature are available from the singscore R/Bioconductor package.

The purpose of this project is to find samples that are more concordant with the TGFb-EMT signature. To do this, we need to use a gene-set scoring method and samples’ transcriptional profiles to score samples against this signature, and then compare their scores.

To score samples, we use the singscore method, which is available as an R/Bioconductor package. If you are interetsed in the method, you can check the workflow paper by Bhuva et al, using singscore to predict mutations in acute myeloid leukemia from transcriptomic signatures.

library(knitr)
library(DT)
library(plotly)
library(singscore)
library(SummarizedExperiment)
library(GSEABase)

2 Data and signature

The example data available through singscore package is called tgfb_expr_10_se and is a SummarizedExperiment object. This object can include three main components: assay (e.g. the gene expression matrix), rowData (e.g. data frame for gene information), and colData (e.g. data frame for sample information).


Looking at the head of the expression data, we see that samples are in columns, and each row represents expression of a gene, with Entrez IDs as row names.

datatable(assay(tgfb_expr_10_se)[1:20, 1:4])


We would like to add more annotation file to the samples; we add these to the colData slot of the tgfb_expr_10_se object.

tgfb_expr_10_se@colData$Replicate[grepl("R1", rownames(tgfb_expr_10_se@colData))] <- "Replicate 1"
tgfb_expr_10_se@colData$Replicate[grepl("R2", rownames(tgfb_expr_10_se@colData))] <- "Replicate 2"

tgfb_expr_10_se@colData$Study[grepl("D_", rownames(tgfb_expr_10_se@colData))] <- "Deshieri"
tgfb_expr_10_se@colData$Study[grepl("Hes_", rownames(tgfb_expr_10_se@colData))] <- "Hesling"
tgfb_expr_10_se@colData$Study[grepl("Hil_", rownames(tgfb_expr_10_se@colData))] <- "Hill"


We generate an interactive annotation table using DT package.

DT::datatable(data.frame(colData(tgfb_expr_10_se)), filter = "top")


We can modify the style of the table format pretty easily.

datatable(data.frame(colData(tgfb_expr_10_se))) %>%
  formatStyle("Treatment", backgroundColor = styleEqual(unique(colData(tgfb_expr_10_se)$Treatment), c("lightseagreen", "orange")))

The TGFb-EMT gene expression signature has up-regulated gene set as well as down-regulated gene-set. A sample with high evidence of TGFb-EMT signature shows high expression of the up-regulated genes and low expression of the down-regulated genes compared to samples with low evidence of this signature. Looking at the up- and down- regulated genes, we see that they are Entrez IDs.

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
## head of gene sets
head(geneIds(tgfb_gs_up))
## [1] "19"  "87"  "182" "241" "323" "406"
head(geneIds(tgfb_gs_dn))
## [1] "136" "220" "224" "288" "330" "360"

3 Score samples

3.1 Rank genes in single samples

We need to rank the expression data first.

rankedData <- rankGenes(expreMatrix = assay(tgfb_expr_10_se))
datatable(head(rankedData))

3.2 TGFb-EMT signature score

sigScore_tgfb <-
  simpleScore(
  rankData = rankedData,
  upSet = tgfb_gs_up,
  downSet = tgfb_gs_dn,
  subSamples = NULL,
  centerScore = T,
  dispersionFun = "MAD",
  knownDirection = T
  )

datatable(sigScore_tgfb, filter = "top")

3.2.1 Scores vs dispersion

We visualise scores in samples with the highest and lowest scores.

plotDispersion(scoredf = sigScore_tgfb, 
  annot = colData(tgfb_expr_10_se)$Treatment, 
  annot_name = "Group", 
  isInteractive = T, textSize = 0.5)

3.2.2 Rank density plots

plotRankDensity(
  rankData = rankedData[, "D_TGFb_R1", drop = F],
  upSet = tgfb_gs_up,
  downSet = tgfb_gs_dn,
  isInteractive = T
)
plotRankDensity(
  rankData = rankedData[, "D_Ctrl_R2", drop = F],
  upSet = tgfb_gs_up,
  downSet = tgfb_gs_dn,
  isInteractive = T
)

4 EMT landscape

Cursons et al, Combinatorial Targeting by MicroRNAs Co-ordinates Post-transcriptional Control of EMT, Cell Systems, 2018
Epithelial-Mesenchymal landscape from Cursons et al paper published in Cell Systems

(#fig:EMT landscape)Epithelial-Mesenchymal landscape from Cursons et al paper published in Cell Systems

4.1 Epithelial and Mesenchymal signatures’ scores

We read epithelial and mesenchymal signatures from Tan et al. The signatures have been extracted from the paper and are saved under the data folder.There are separate epithelial and mesenchymal signatures for cell lines and tumour data.

emt <- read.table("../data/Thiery_EMTsignature_both_tumour_cellLine_EntrezIDs.txt", header = T, sep = "\t")

epi <- emt$EntrezGene.ID[emt$epiMes_cellLine == "epi" ]
mes <- emt$EntrezGene.ID[emt$epiMes_cellLine == "mes" ]

epi <- epi[complete.cases(epi)]
mes <- mes[complete.cases(mes)]

We score samples using epithelial and mesenchymal signatures. These are expected up-regulated gene sets.

sigScore_epi <-
  simpleScore(
  rankData = rankedData,
  upSet = epi,
  subSamples = NULL,
  centerScore = T,
  dispersionFun = "MAD",
  knownDirection = T
  )

sigScore_mes <-
  simpleScore(
  rankData = rankedData,
  upSet = mes,
  subSamples = NULL,
  centerScore = T,
  dispersionFun = "MAD",
  knownDirection = T
  )

4.1.1 Signature landscape for cell lines

p1 <- plotScoreLandscape(
  scoredf1 = sigScore_epi,
  scoredf2 = sigScore_mes,
  scorenames = c("Epithelial", "Mesenchymal")
  )

projectScoreLandscape(
  plotObj = p1,
  scoredf1 = sigScore_epi,
  scoredf2 = sigScore_mes,
  annot = colData(tgfb_expr_10_se)$Treatment,
  annot_name = "Group",
  sampleLabels = colnames(rankedData),
  isInteractive = T
  )

Here, I load the epithelial and mesenchymal scores for the TCGA breast cancer data. I have already downloaded the TCGA data, calculated the RPKM values, and scored samples. To see how to do all these on the TCGA data, see the workflow paper by Bhuva et al.

TCGA_EMTscore <- readRDS('../data/TCGA_BrCr_EMT_score.RDS')

p2 <- plotScoreLandscape(
  scoredf1 = TCGA_EMTscore$TCGA_BrCr_Epi,
  scoredf2 = TCGA_EMTscore$TCGA_BrCr_Mes,
  scorenames = c("Epithelial", "Mesenchymal")
  )

projectScoreLandscape(
  plotObj = p2,
  scoredf1 = sigScore_epi[1:4, ],
  scoredf2 = sigScore_mes[1:4, ],
  annot = colData(tgfb_expr_10_se)$Treatment[1:4],
  annot_name = "Group",
  sampleLabels = colnames(rankedData)[1:4],
  isInteractive = T
  )

5 Interactive plot independent of singscore

5.1 Using ggplotly

You only need to pass a ggplot object to ggplotly to get the inteactivity. You could

## p2 is a ggplot object
ggplotly(p2)

We can also project the cell line scores on top of it.

annot <- colData(tgfb_expr_10_se)$Treatment[1:4]
annot_name <- "Group"
sampleLabels <- colnames(rankedData)[1:4]

newdata <- data.frame(
    'sc1' = sigScore_epi[1:4, "TotalScore"],
    'sc2' = sigScore_mes[1:4, "TotalScore"],
    'SampleLabel' = sampleLabels,
    'Class' = annot
  )

p3 <- p2 +
  geom_point(
    aes(text = SampleLabel, colour = Class),
    shape = 21,
    fill = 'white',
    size = 2,
    stroke = 2,
    data = newdata
  ) + 
  scale_colour_brewer(palette = 'Dark2') +
  labs(colour = annot_name)

ggplotly(p3)

6 Session information

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
##  [3] BiocParallel_1.18.0         matrixStats_0.54.0         
##  [5] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0        
##  [7] singscore_1.4.0             GSEABase_1.46.0            
##  [9] graph_1.62.0                annotate_1.62.0            
## [11] XML_3.98-1.20               AnnotationDbi_1.46.0       
## [13] IRanges_2.18.1              S4Vectors_0.22.0           
## [15] Biobase_2.44.0              BiocGenerics_0.30.0        
## [17] plotly_4.9.0                ggplot2_3.2.0              
## [19] DT_0.7                      knitr_1.23                 
## [21] BiocStyle_2.12.0           
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.0             edgeR_3.26.5           tidyr_0.8.3           
##  [4] bit64_0.9-7            jsonlite_1.6           viridisLite_0.3.0     
##  [7] shiny_1.3.2            assertthat_0.2.1       highr_0.8             
## [10] BiocManager_1.30.4     blob_1.1.1             GenomeInfoDbData_1.2.1
## [13] yaml_2.2.0             pillar_1.4.2           RSQLite_2.1.1         
## [16] lattice_0.20-38        glue_1.3.1             limma_3.40.2          
## [19] digest_0.6.19          RColorBrewer_1.1-2     promises_1.0.1        
## [22] XVector_0.24.0         colorspace_1.4-1       plyr_1.8.4            
## [25] htmltools_0.3.6        httpuv_1.5.1           Matrix_1.2-17         
## [28] pkgconfig_2.0.2        bookdown_0.11          zlibbioc_1.30.0       
## [31] purrr_0.3.2            xtable_1.8-4           scales_1.0.0          
## [34] later_0.8.0            tibble_2.1.3           withr_2.1.2           
## [37] hexbin_1.27.3          lazyeval_0.2.2         mime_0.7              
## [40] magrittr_1.5           crayon_1.3.4           memoise_1.1.0         
## [43] evaluate_0.14          tools_3.6.0            data.table_1.12.2     
## [46] stringr_1.4.0          munsell_0.5.0          locfit_1.5-9.1        
## [49] compiler_3.6.0         rlang_0.4.0            grid_3.6.0            
## [52] RCurl_1.95-4.12        htmlwidgets_1.3        crosstalk_1.0.0       
## [55] labeling_0.3           bitops_1.0-6           rmarkdown_1.13        
## [58] gtable_0.3.0           DBI_1.0.0              reshape2_1.4.3        
## [61] R6_2.4.0               dplyr_0.8.2            bit_1.1-14            
## [64] stringi_1.4.3          Rcpp_1.0.1             tidyselect_0.2.5      
## [67] xfun_0.8